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(54) Method for searching In large databases of automatically recognized text 

(57) A metiiod for searching for a query word in a 
database of automatically recognized text generated, 
for example by an optical character recognition (OCR) 
system or a speech recognition (SR) system finds 
entries that most closely match the query word. The 
database is indexed into a trie data structure, which rep- 
resents ail possible words in the database. The trie data 
structure has a plurality of branch nodes, each repre- 
senting a letter of at least one word, and a plurality of 
-'leaf nodes, each representing a respective word. The 
trie data structure is searched for each query word by 
selecting the first letter of the query word and also 
selecting a root node in the trie data structure as the 
cun'ent node. All possible child nodes of the current 
node are identified. Respective estimated probability 
values for matching respective letters of the query word 
with the letters associated with the nodes in the path 
taken through the trie data structure are calculated for 
each Identified child node. The Identified child nodes 
are then placed into a list of candidate nodes. The node, 
in the list of candidate nodes, having the highest proba- 
bility value is selected as the current node and is then 
deleted from the list of candidate nodes. The process 
repeats with this current node until a leaf node is 
reached. When a leaf node is reached, a determination 
Is made whether to store the word into a list of best 
matches based on the probability value of the word 
compared to the probability values for all the words In 
the list of best matches. 
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Description 



TECHNIQAU FISIP 



5 [0001] The present Invention relates generally to the searching for query keywords In large databases of automat- 
ically recognized text and, more specifically, to a method for searching using approximate matching techniques which 
compensate for errors in text generated by an opticai character recognition (OCR) system or a speech recognition (SB) 
system. The invention indexes the databases in a way thatfadl'rtates and speeds up the retrieval process. 

10 BACKGROUND OF THE INVENTION 

[0002] Large text retrieval systems are often built by extracting text information from documents using OCR or from 
spolcen words, using a speech recognition system and inserting the extracted text into a database. OCR and SB 
devices are prone to errors and hence the database may contain en'oneous words. This maices it difficult to retrieve doe- 
rs uments that contain query words given by a user. 

[0003] Today, It Is common practice to store a large number of paper documents Into a database. These documents 
need to be retrieved later by searching for some words that appear in the document One way to achieve this is by 
extracting words from digitized documents using OCR technology. These words are then stored into the database and 
are used for searching and retrieval. 
20 [0004] In addition, speech recognition systems are becoming more widely used. With a commercially available 
speech recognition system, a user may dictate a new document for the database or convert an existing document from 
text to machine readable fomn by simply reading the document into a microphone connected to a personal computer. 
While these systems are generally accurate with sufficient training, it is well known that some errors do occur. Typically, 
the speech recognition system allows these errors to be con-ected manually by the user. A user may not, however, 
25 detect all of the en-ors and, so, the resulting database may include words which are not in the original text document. 
[0005] Because neither OCR technology nor speech recognition technology Is 1 00% error free, the scanned data- 
base may contain words that were misspelled in the conversion process. The conventional method of using direct 
searching techniques may not locate all of the appropriate documents In the database that contain a given query word 
because the con^pondlng word in the database is misspelled in some of the documents. 
30 [0006] Another problem witii the conventional rhethod of using direct searching techniques is that the size of the 
database may be very large, and hence the search process may be very slow. 

[0007] The conventional solution is insufficient to find words In the database that have been recognized Incorrectly 
and are misspelled. Also, tiie conventional solution is too slow in searching large database. There is a need for a more 
efficient search algoritiim. 



SUI^MARY OF THE INVENTION 

[0008] To meet this and other needs, and in view of its purposes, the present Invention provides a simple and effec- 
tive method for searching for a query word in a hierarchical data structure. The data structure has branch nodes and 

40 leaf nodes, each branch node represents a respective portion of one or more words and each leaf node represents a 
word, The data structure is searched for each query word by selecting the first letter of the query word and also select- 
ing a root node in the hierarchical data structure as the cun-ent node. All possible child nodes of the current node are 
identified. Respective estimated probability values for matching respective components of the query word with the com- 
ponents associated with the nodes in the path taken through the hierarchical data structure is cateulated for each iden- 

45 titled child node. The identified child nodes are then added to a list of candidate nodes. The candidate node with the 
highest probability value Is selected as the current node and is then deleted from the list of candidate nodes. If a leaf 
node has been reached, then a detemilnation is made whether to store the word into a list of best matches. Processing 
repeats itself for each portion of the query word. When all portions of the query word have been matched, the matched 
words with their respective probability values are stored into the list of best matches. 

so [0009] It is to be understood that both the foregoing general description and the following detailed description are 
exemplary, but are not restrictive, of the invention. 

BRIEF DESCRIPTION OF THE DRAWING 

55 [0010] The invention is best understood from the following detailed description when read in connection with ttie 
accompanying drawing. Included in the drawing are the following figures: 

Figure 1 is a high-level flow diagram of an exemplary embodiment of the present invention; 
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Figure 2 is a data structure diagram which illustrates principles of operation of systems using a trie data structure; 
Figure 3 illusb^tes a tree representation of the trie of Figure 2; and 



DETAILED DESCRIPTION OF THE INVENTION 

[001 1 ] In order for retrieval systems In databases of automatically recognized text to function properly, each query 
10 keyword is desirably retrieved using matching techniques that account for the existence of errors In the automaticaiiy 
recognized text, in addition, it is desirable for the database to be Indexed to speed up the search. Current search algo- 
rithms, however, do not search the index efficiently. A new search algorithm is provided that cures some of the deficien- 
cies of other search techniques (e.g. depth-first). 

[0012] Although the invention is described in terms of an exemplary embodiment which processes documents con- 
15 talning words that were automatically recognized using optical character recognition and speech recognition, the inven- 
tion has more general application to processes in which an "alphabet" of symbols may be combined to fonn groupings 
of symbols that are then converted to a machine-readable form using an automated process which may erroneously 
recogni2e the groupings of symbols by substituting. Inserting or deleting arbitrary symbols from any particular group. 
[0013] Fig. 1 shows an overview of an exemplary method for searching for query words in an indexed database 
20 according to the present invention. The search method is a computer program, which may reside on a carrier such as 
a disk, diskette or modulated carrier wave. As shown In Fig. 1 , in step 1 0, the database, stored in a trie data structure, 
Is indexed. Then, in decision step 20, each query input word is processed, in decision step 30, processing continues 
until each letter of the input word is processed or the budget is exhausted. When processing is completed, step 80 
stores the words that best match the input word along with their associated probabilities and processing returns to step 
25 20 to process the next input word. 

[0014] Returning to decision step 30. if processing has not completed then control passes to step 40. In step 40, 
the candidate (child) nodes from the next level of the Trie are added to the list of candidate nodes (hereafter referred to 
as the OPEN set), tf the first letter of a word is being processed at step 40, the root node of the Trie Is added to the 
OPEN set 

30 [0015] Continuing with Rg. 1 , in step 50, the probability f(n) for each newly added candidate node is calculated. In 
the exemplary embodiment of the present invention, the probability is calculated initially using a simplified error model 
(described below). The present Invention provides a means for migrating from the simplified error model to a more 
detailed error model (also described below) as sufficient training data becomes available. 

[001 6] In step 60, the OPEN set of candidate nodes is sorted by the probability f(n) that was cateuiated in step 50. 
35 Next in step 70 the node with the highest probability is visited and is deleted from the OPEN set In decision step 80, 
a check Is made to determine if a leaf node has been reached on any of the branches of the Trie. Then in step 90. a 
determination Is made whether to add the word that is associated with the leaf node to the list of best matches. Process- 
ing returns to decision block 30. 

[001 7] The search strategy according to the present Invention advantageously removes the prefix sensitivity of the 
40 depth-first search method because it is capable of backtracking before following a branch off the Trie all the way to a 
leaf node. 

[0018] According to a further aspect of the present invention, the search method for the Trie may also provide for 
matching in the presence of recognition enrore. The errors in text generated using OCR or SR can be modeled as fram- 
ing errors of the following types: Insertion, deletion, and m-n substitution errors. The following explains each error type 
45 briefly and gives examples of their occurrences. 

[0019] An insertion en^r occurs when extra letters are added into a word because of noise in a digitized document. 
An example of an insertion en^or Is when the word "airmail" is recognized as "airnmail." In this case, the letter n Is mis- 
takenly inserted in the middle of the word by the scanner and OCR algorithm or by the mtorophone and the SR algo- 
rithm. 

50 [0020] A deletion error occurs when some letters in the original word are omitted from the automatically recognized 
word. An example of a deletion en-or is when the word "aimriail" is recognized as "aimail." I.e., missing the letter r . 
[0021] An m-n substitution error occurs when m tetters in the original word are substituted by n other symbols In 
the automatically recognized word. An example of a 1 -2 sutjstitution en-or is when the word "aimnaii" is seen by the sys- 
tem as the word "aln-nall." In this case, the letter m of the original word is substituted by the letters r and n in the rec- 

55 ognlzed word. 

[0022] In order to compute the likelihood that one word maps to another word, the invention uses one of two mod- 
els. These two models are the detailed and the simplified en-or models. Generally, the invention begins by using the sim- 
plified error model and may migrate to the detailed model once sufficient training data becomes available to populate 
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Rgure 4 is an example of a trie data structure. 
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the detailed model. 

[0023] The detailed error model is described below. The detailed error model assumes that x, y, and z are letters 
of the alphabet. The notation Pa(ylx) is used to denote the probability that the letter x in the database word is errone- 
ously recognized as the letter y during the automatic recognition processes. In the detailed ent)r model, the following 
probabil'rties are defined(E stands for the null letter): 



• Pn(x): Is the probability that the letter x is recognized properly without any errors during the recognition process, 
i.e.. p„(x)^pjx\x). 

• PdW- the probability that the letter x is erroneously deleted from a word by the recognition process, I.e., 

• p}(x): is the probability that the letter x Is envneousty inserted into a word by the recognition process, i.e., 

Pi(X)=pJX\E). 

IS 

• Psfy^x): Is the probability that the letter x is enroneously recognized as the letter y (a 1 -1 substitution). 

• Ps(yz \x): Is the probability that the letter x is en^neously recognized as the two letters yz (a 1-2 substitution). 

20 • Ps(z\xy): is the probability that the two letters xy are erroneously recognized as the letter z (a 2-1 substitution). 

The values of these probabilities can be gathered statistically beforehand for each possible letter or combination of let- 
ters using a large training database. 

[0024] Because the detailed probability model calculates a large number of probability values for each letter In the 
25 alphabet and for each possible pair of letter in the alphabet, It may be difficult to use. Accordingly, a simplified en^or 
model is defined which Is limited to only estimating the following probabilities: 

• Pf^\ is the probability that no en-or occurs while recognizing a letter In a word. 

30 • pf. is the probability that an insertion enx)r occurs while recognizing a letter In a word. 

• Pff, is the probability that a deletion enror occurs while recognizing a letter In a word. 

• P7t: Is the probability that a 1 -1 substitution error occurs while recognizing a letter In a word. 

35 

• P12: is the probability that a 1 -2 substitution error occurs white recognizing a letter In a word. 



• p^f : is the probability that a 2-1 substitution error occurs while recognizing a letter in a word. 

40 Notice that these variables are Independent of the specific letters and hence are much simpler to estimate, requiring a 
much smaller training database than the detailed enor model. On the other hand, the simplified en'or model Is less 
accurate than the detailed en^or model. 

[0025] The edit distance measure is used to estimate a measure of similarity between two words. The cost of map- 
ping a query word to a database word w^^ is computed by detemiining the operations necessary to map Wg to 
45 and multiplying the probability values associated with all the operations. For example, the similarity measure between 
the database word "art" and the query worcl "are" can be computed as: Pf^(a) x Pn(r) x Ps(t\e), using the detailed en-or 
model or PfjX p„x P11, using the simplified error model. 

[0026] In the case when there Is more than one set of operations that map one word to another, dynamic program- 
ming techniques may be used to find the set of operations with minimum cost (or with maximum likelihood). A dynamic 

50 programming algorithm can be used to compare any two words, e.g., a query word against a database word. 

[0027] It should be noted though, that the method presented In the exemplary embodiment of the invention does 
not use dynamic programming techniques directiy, because, to do this, would imply doing a sequential scan over the 
entire document database. Accordingly, the new metiiod may be viewed as hierarchically searching through the docu- 
ment database and incrementally applying techniques similar to dynamic programming only to the portion of the data- 

55 base where it is more likely to find words that are closely similar to the query word. 

[0028] The trie structure used in tiie exemplary embodiment of the invention Is an M-ary tree, the nodes of which 
each have M entries, and each entry corresponds to a digit or a character of the alphabet. An example trie is given In 
Figure 2 where tiie alphabet Is the digits 0. . . 9. Each node on level / of the trie. represents tiie set of all keys that begin 
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with a certain sequence of / characters; the node specifies an ^f-way branch, depending on the (/ + l)st character. 
Notice that in each node an additional null entry is added to allow for storing two numbers a and b where a Is a prefix 
of b. For example, the trie of Rgure 2 can store the two words 91 and 91 1 by assigning 91 to the null entry of node A. 
in this instance, 91 Is stored as a branch node and 911 is stored as a leaf node. 

5 [0029] The memory space of the trie structure can be significantly reduced (at the expense of running time) if, for 
example, a linked list is used for each node, since most of the entries in the nodes tend to be empty. This embodiment 
amounts to replacing the trie of Figure 2 by the forest of trees shown in Figure 3. Although the present invention is 
described using a trie data structure, it may also be implemented with other hierarchical data structures. As described 
below, the subject Invention operates by traversing a single tree. Where the hierarchical data structure Includes a forest 

10 of trees, it may be desirable to add a fictional root node to the data structure to simplify the processing of the data struc- 
ture by the algorithm. 

[0030] Searching for a word in the trie proceeds as follows. The search starts at the root and iool<s up the first ietter, 
and then follows the pointer next to the letter and lool<s up the second letter in the word In the same way. On the other 
hand, searching In the forest version of the trie proceeds by finding the root node and matching the first letter in the 

IS query word, then finding the child node of that root which matches the second letter, etc. it can be shown that the aver- 
age search time for words stored in the trie is log^ N and that the "pure* trie requires a total of approximately N/InM 
nodes to distinguish between N random words. Hence the total amount of space is MN/lnM, 
[0031] There is a major difficulty with the trie data structure when used in conjunction with databases that contain 
en^rs in them. The trie data structure is prefix-sensitive. In the process of matching, when using depth-first search and 

20 backtracking, there is a tendency towards connoting mismatches that may have occun-ed at the deeper levels of the trie 
before correcting errors that may have occurred at higher levels. On the other hand, using conventional search tech- 
niques, the correcting of mismatches that may have possibly occurred at the higher levels of the trie (e.g., the root) is 
typically performed only after exhausting possible matches that are deeper in the trie. 

[0032] For example, consider the trie of Figure 4, and assume that the search is for the query word "aetna." Assume 
25 further that at document recognition time, the word 'aetna" was erroneously recognized and stored into the database 
as the word 'getna.* The nodes are labeled by the probability of accepting the corresponding letter of the query word 
using the detailed error model given In Section 2.1 . For example, from Figure 4, the prcbabiiity of recognizing the letter 
"e" of the word "aetna" as being the ietter r (i.e., a 1-1 substitution error) Is 0.001 . The overall probability of accepting a 
word is computed as the product of all the accepting probabilities from the root to the leaf where the word is stored. For 
30 example, from Rgure 4, the overall probabilities of matching the query word "aetna" with the sequence "a", V "o" "r" , 
"n" in the trie is: 0.8 x 0.0015 x 0.0005 x 0.01 x 0.01 = 0.6 x 10*^° Ail the leaf nodes in the Figure are labeled with the 
overall accepting probabilities. Using a depth-first search algorithm, and given the query word "aetna" the node a is vis- 
ited first since the word in the database was erroneously recognized as "getna" instead of "aetna." According to the 
depth-first search algorithm, the trie paths will be searched in the following order: acorn, armor, arerta, and finatty 
35 getna. Alternatively, nottee that the order Implied by the overall probabilities of the words is getna, arena, acorn, and 
armor. Moreover, If there are some additional limitations on the search time, the algorithm may run out of the allowable 
search time before it reaches the correct word (getna). 

[0033] The above example demonstrates the sensitivity of the depth-first search algorithm to the decisions per- 
formed by the algorithm at the levels in the trie that are close to the root. The present invention provides a new algo- 
40 rithm, based on the A* graph search algorithm, that reduces the problems demonstrated above. 

[0034] The new search algorithm uses a variant of the A" graph search algorithm to overcome prefix sensithfitles of 
the trie. It can be used for matching words that may contain some errors in them. 

[0035] An evaluation function / is computed at every visited node n during the search. f{n) is defined such that it 
provides an estimate of the overall probability of matching the query word with the words in the leaves of the subtree 

45 rooted at n. The value of f{n) determines the order by which the search algorithm visits the nodes of the trie. This is in 
contrast to visiting the nodes in a last-In first-out manner when using a depth-flrst-llke algorithm. 
[0036] A list termed OPEN is maintained that stores all the candidate nodes that are visited during the search. 
Nodes in OPEN are sorted In decreasing order by their f value. Initially this list includes only the root node of the trie or 
only the root nodes of the forest of the trees. 

so [0037] A path from the root to a leaf node defines a word in the database. Based on the A* search algorithm, and 
in the context of word matching, for an input word S/„, an optimal patti is defined to be the path from the root node to 
the leaf that contains the word In the database such that s^ has the highest accepting probability over the probabil- 
ities of all the other words in the database. For each node n, r{n) Is the highest possible matching probability for all the 
words In the database with paths passing through node n. This con-esponds to a path passing from the root r to n and 

55 finally to the leaf node containing From this definition of r. it can be concluded that, for the trie, the optimal match 
for an input word is r(r) where r is the root of the trie, since r is in the path of all the words in the database. 
[0038] Because It is very time-consuming to visit all the words in the database that descend from a given node n, 
f(n) is defined to be an estimate for the optimal value of r(n). Furthermore, f(n) is expressed In two parts: g*{n) which 



5 



EP 1 052 576 A2 

is the actual overall probability of the path from the root node r to node n (there Is only one such path in the case of the 
trie and hence this path is also an optinna! one), and h(n) which is an estimate of the overall probability of the path from 
n to the leaf node of the word with the highest overall probability among all the words in the leaf nodes of the subtree 
rooted at n. Therefore, 

5 

f(nhg'(n) X h(n). 

The value of f{n) is then an estimate of the overall probability of an optimal path from r constrained to go through node 
n. 

10 [0039] Choosing h(t) =1 as an upper bound (since the probability Is always less than or equal to one) results In a 
breadth-first search. The goal is to find tighter upper-bounds that help speed-up the A* search. 
[0040] The maximum probability can be computed over all letters in the following way: Let L be the length of the 
query word, and let df, be the depth of a node t, where the root of the trie is at depth 0, i.e., df Is the length of the path 
from the root of the trie to node t Also, let s a^, . . . a ^ be the input word. Then, at depth d,, of the trie, the part 

IS of the query word that is processed so far is a prefix of S/„, of length df, and can be expressed as: s j=a a 2, . . a . 

The maximum possible h(t) can be attained If all of the remaining letters of s/„ (i.e., a^/^/ a^. are matched by other 

letters that result in the highest possible probabilities, i.e., each of the remaining letters matches without any error. 
Therefore, h(t) can be estimated as: 

20 For the detailed enx>r model. 

White for the simplified error model, 



25 



[0041] As described previously, the new search algorithm can handle the following types of errors: Insertion, dele- 
tion, and m-n substitution errors. In order to accommodate framing errors, the elements of the list OPEN are In the fomi 
30 of the 5-tuples {error-type, node, word-location, g(node), h{node)). The following describes the elements of OPEN: 

• error-type can have one of the following assignments: 

1 . m - an enor-free match, 

35 

2. d-a deletion error, 

3. / - an Insertion error, 

^ 4. Sfnn ' an m-n substitution error, and 

• node is the node of the trie to be visited when the 5-tuple is to be processed by the algorithm. 

• word-location is the location in the query word where the search will resume when the 5-tuple is to be processed 
45 by the algorithm. 

• g{node) Is the probability of matching the prefix of the query word that has been processed so far with the nodes 
of the trie that correspond to the path from the root of the trie to the current node. 

so • h {node) Is the estimated probability of matching the remaining portion of the query word with the nodes of the trie 
that are descendents In the subtree rooted at node. 

[0042] The overall evaluation function f(node) is computed as the product of g(node) and h(node)» The 5-tupies in 
OPEN are sorted according to the values of fin). At each step of the algorithm, the tuple with the highest possible f(n) 
55 Is selected from OPEN and is visited by the algorithm. The query word is represented by the letters a^a^ . . . a/,, where 
L Is the number of letters in the query word. Assume that the algorithm is at node n, and that the query word is matched 
up to letter a^ (i.e., that the remaining portion of the word that still needs to be matched is: a^^y ... a J and the evaluation 
functions g=g(n) and h=h(n), i.e., the algorithm is processing the tuple (*./7,a,.... a/,, g=g(n), h=h(n)). The algorithm 
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handles framing enxsrs in the following way. Note that, In the discussion below, only 1-1,1-2 and 2-1 substitution errors 
are described. The more general m-n torn can be treated exactly similarly (the case of m > n Is treated in the same 
way as the 2-1 case while the case of m <n is treated in the same way as the 1 -2 case). 
[0043] The following outline describes how the invention handles the enrors: 

5 

1 , no errors {or exact matching): The algorithm inserts the following tuple into the set OPEN: 
For the detailed error model, 
10 {m, c/)/W(/?,s), a,^, . . . a^, Sf=:g(fi) xp„(a^), h=^\\^^f^^p„{a{^). 



while for the simplified ennor model, 

<m, child(n,s), a,+ , ... a^^, g=9(n)xp„, h=p^'% 

15 

where child {n,s) is the child node of n that corresponds to the letter s, g is the probability of the optimal path 
from the root to node n, and h is an estimate of the probability of matching the rest of the input word with the 
nodes of the trie, 

20 2. 1 -1 substitution errors {i.e., no framing errors): For each letter s in node n, the algorithm computes the probabil- 
^y- Ps(^'^/) Qi^cl Inserts the following tuple into the set OP£M- 



For the detailed error model, 
25 (s,i. chitd(n,s), a^^, . , . a^, g=g(n) xpjsla^), /j=lltf+i Pn^a^ 



while for the simplified error model, 

{s „, child(n,s), a^^^ . . . a^, g^g(n)xp^^, ^=Pn'')- 

30 

3. deietion error: in this case, it is assumed that the current letter in the query word (the letter a/) has no corre- 
sponding letter in the database word as a^ was deleted from the database word during the scanning process. The 
search should resume from the next letter of the query word. 

As a result, the algorithm Inserts the following tuples into OPEN: for each letter s in node n, the algorithm inserts 
35 the following tuple into the set OPEN: 

For the detailed error model, 

id,n,a ui'"^L, 9^9(n) x p Ja ,A Ai= //^^^ (a^)), 

40 

white fbr the simplified error model, 

<d,n^,^, . . . a^^ g=g(n)xp^, h=pj;\ 

45 4. Insertion error: in this case, it is assumed that there is an extra letter being added into the database word during 
the scanning process. Just prior to the position of cun-ent letter In the query word. This is treated by slapping the 
cun^nt node(node n), and matching the cun-ent letter with all of the letters that appear In the child nodes of node 
n. As a result, the algorithm inserts the following tuples into OPEN: for each child node u of n, and for each letter 
s in node u, the algorithm inserts the following tuple into the set OPEN: 

so 

For the detailed en-or model, 

{i,ct}Hd(u,s),at^j . . . a^, g^g(n) x p i(s). h= i{lf^jp„(a,)), 
55 While for the simplified error model, 

{t,chitd(u,sla,^j .,,a^ g=g(n)xpf h=pj;\ 
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5. 2-1 substitution error: In this case, It Is assumed that two letters of the query word were erroneously mapped to 
one letter that was Inserted In the database Instead of the two letters. This took place during the recognition proc- 
ess of the database word. This error type Is treated by matching both the current and next letter of the query word 
with the current letter in the database node. 
5 As a result, the algorithm takes the following actions: for each letter s In node n, the algorithm Inserts the following 

tuple Into the set OPEAT: 



For the detailed error model, 
10 {S2i Mld(n,s),af^2 - - ^l. 9=9(n)xp(s\a^f^j), h=uj;^^p„(ai)), 

while for the simplified enror model, 

IS 

6. 1-2 substitution error. In this case, it is assumed that the current and next letters of the database word corre- 
spond to one letter that was split into two letters during the recognition process and then was inserted into the data- 
base, in other words, the query word has the connect letter while the database word has two letters. This error type 
is treated by matching all the pairs of letters in the current node and Its children nodes with the current letter of the 
20 query word. 

The algorithm performs the foliowing actions: for each letter s in node n, and for each letter that is in a child node 
u of n, the algorithm computes the probability: p (ssjat) and Inserts the following tuple into the set OPEN: 

For the detailed error model, 

25 

{s^2S^lld(u,s),af^^ . . . g=g(n) x p(sSy\af), h=znf^^^pjaf)}, 

while for the simplified model, 

30 <s child(u,s),a a ^ g=g(n) x p ^j^. ^^Pn '^• 

[0044] The values of p^p Pf, p^y and p^^i reflect ttte probabilities that the word scanned into the database con- 
tains an insertion, deletion or a substitution error. These probability values may also be viewed as parameters that can 
be tuned to direct the search. The exact values of these probabilities are scanner dependent and hence are difficult to 

35 quantify in general terms. 

[0045] These values may be estimated by, for example, using results from studies that empirically measure the fre- 
quencies of occun^nce of these types of errors by different types of scanners. Using these values as initial guesses, 
the system may be Incrementally trained by updating the values of these parameters each time a match is performed 
by the system. An alternative method of calculating the probabilities Is to perfbmi a separate training session and then 

40 count the number of errors that occur in the session. These errors are then used as estimates for the desired probability 
values. 

[0046] The invention maintains the following data structures: a list or queue OPEN and the list best^matcties. The 
input to the method is a query word Si„ and an integer variable budget that controls the run-time of the invention. The 
variable budget gets decremented with each iteration of the Invention. 
45 [0047] Each time the entire query word is matched by a word in the trie, the matched word as well as Its matching 
probability are stored In the sorted list best_matches. The size or number or words that can be kept in best_matches 
at a given time can be limited by a constant, say k. By the end of the searching process, the k words in best_matches 
can be reported as the best k matches that are found so far. 

[0048] The outiine of the method is given below. The method Is outlined using tiie detailed error model. However, 
so the method can be modified easily to use the simplified error model. Pseudo code for this method Is shown In the fol- 
lowing table. 



55 
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TABLE 

1 . start at the root r of the trie 

2. store Km^r^a^a^ , . . ai,g^L h=llf^iP„(aj)> into the list OPEN 

3. Loop until budget is exhausted or no more nodes in OPEN: 

(a) niple T<-iop element of OPEN 

(b) rir^nodefD 
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(d) if rir has no children (i.e., is a leaf node), then perform leaf-node- 
action; then 

i. 5x4^ word associated with leaf node rij 

ii. if query word still has more unmatched characters (i.e., 
L>depthlnj) or even length (s^) ^ length (sj) then the extra 
letters are considered deleted from the query word, and hence 

iii. otherwise, compute P(St), the entire probability of matching Sr. 

iv. if number of words in bestjnatches < A: or if P(Sr) is higher than 
any of the words in bestjnatcheSy insert Sr into bestjnatches. 
This may result in excluding the word in bestjnatches that has 
the lowest matching probability. 

e. expand node n^ (nj- is a non-leaf node): 

if all letters of are consumed, then skip n-r, else 

i. a, 4- next letter iaSi^ 

ii. decrease budget: 
budget ^budget - 1 

iii. for all letters 5 in n^. 

insert the following tuples into OPEN: 
<mxhild(n,s),a,^, . . . Cj^^g'^gN x pjajh=llf^^^pjaj> , 

<s,„ child(n.sha,^, . . . a^.g^g(n) xp,(s\aXh^\L^^^ypJa^> . 

<d,n,a,^i . . . at,.g==g(n) X pMh^Uf,,,^pJa^> , 

<s.,, child(n.s),a,^2' - - ciug='g(n)xp(s\afl,J.h=-Ul,^:^p„(aJ>. 

iv. for each child u of n^, 

insert the following tuples into OPEN: 
<lchild(u,s),a,^i . . . ai^.g='g(n)xpi(s),h=-lll,,ypja)>, 

<s,2,child(u,s).a,^i . . . a^^.g^gW x p(ssjaj,h^ul,^^pjaj> , 
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V. reorder the nodes in OPEN in descending order based on their 
values of f-g x h, 

5 

[0049] Where a forest of trees Is used instead of a trie, a fictional root node, con-esponding, for example, to the null 
character, may be added to the forest of trees prior to step 1 shown in the Table. This converts the forest of trees Into a 
10 single tree having a single root node. 

[0050] The method shown In the Table Is an alternative representation of the flow chart shown in Figure 1 , 
[0051] Although illustrated and described herein with reference to certain specific embodiments, the present inven- 
tion is nevertheless not intended to be limited to the details shown. Rather, various modifications may be made in the 
details within the scope and range of equivalents of the claims and without departing from the spilt of the invention. 

Claims 

1, A method of searching for a query word from among a plurality of words in a hierarchical data structure having 
branch nodes and leaf nodes, each branch node representing a respective portion of one or more of the words and 
20 each leaf node representing a respective one of the words, the method comprising the steps of: 

a) selecting a root node in the hierarchical data structure as the cun-ent node; 

b) identifying all possible child nodes of the current node In the hierarchical data structure; 

25 

c) calculating, for each of the identified child nodes, a respecth^e esftlmated probability value for matching the 
each component of the query word with the component associated with a respective one of the branch nodes 
In a path taken In the hierarchical data structure from the root node to the cun-ent node; 

30 d) adding the identified child nodes to a list of candidate nodes; 

e) selecting, from the list of candidate nodes, one node having the respective estimated probability value which 
is greater than any other probability value as the current node; 

35 f) detemiinlng if the cun'ent node is a leaf node and, if so, then detemiining whether to store the word repre- 

senting the leaf node into a list of best matches; and 

g) repeating steps (b) through (g) until all components of the query word have been processed. 

40 2. A method as defined in claim 1 wherein, the hierarchical data structure is a trie data structure representing a plu- 
rality of words wherein the trie data structure has a plurality of branch nodes and each branch node includes at 
least one child node, and the step (c) includes the step of determining a probability that a next portion of the query 
word matches each child node of the branch node. 

45 3. A method as defined in claim 2, wherein, the trie data structure has N levels, ordinally numbered 0 through N-1 and 
step (d) includes for each candidate node, the step of: 

calculating the estimated probability value associated with the selected element, according to the equation: 

so f (n)=g*(n) x h(n) 

wherein f(n) Is an estimate of the overall probability that the components of the query word match the components 
associated with respective ones of the branch nodes in an optimal path from a root node constrained to go through 
node n, g*(n) is the probability that respective components of the query word match the components associated 
$5 with respective ones of the branch nodes in the path from the root node to the node n and h(n) is the overall prob- 
ability that respective components of the query word match the components associated with respective ones of the 
branch nodes in the path from the node n to the leaf node of the word with a probability that is greater than the prob- 
ability of any other words associated with the leaf nodes rooted at the root node. 
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4. A method as defined in claim 1 further comprising the steps of: 

establishing a data structure which Includes a plurality of entries, wherein each entry consists of a plurality of 
. elements for each of the nodes in the hierarchical data structure, each entry containing elements identifying: 

(1 ) a respective type of error condition that possibly results in a predetermined query sequence occurring: 
and 



(2) Infomnation Identifying an estimated probability that the predetermined query sequence matches any 
of the plurality of words represented by the respective paths in the hierarchical data structure that pass 
through the child node associated with the entry, if the identified type of erroT condition is present. 



wherein, step (e) includes selecting the plurality of elements from among the elements with which the entries in the 
data structure are associated. 

15 

5. A method as defined in claim 4 wherein, the types of en'or conditions include error free matches, insertion errors. 

deletion errors and substitution errors. 



6. A method as defined in claim 1 further including the steps of: 

20 

assigning a search time budget; and 

decrementing the search time budget with each execution of steps (c) through (g). and wherein, the selecting 
of new components in the query word in step (g) Is inhibited when the search time budget has been exhausted. 

25 7. A method as defined in claim 1 wherein, tiie list of best matches has a predate rnilned maximum number of entries 
and, after exceeding the predetermined number, the word In the list of best matches having the probability value 
which is less than any otiier probability value in the list of best matches Is deleted from the list of best matches. 

8. A computer readable medium encoded with a computer program which, when executed, causes a computer to 
30 search for a query word from among a plurality of words in a hierarchical data structure having branch nodes and 
leaf nodes, each branch node representing a respective portion of one or more of the plurality of words and each 
leaf node representing a respective one of the plurality of words, the computer program causing the computer to 
perform the steps of: 

35 a) selecting a root node in the hierarchical data structure as the cun-ent node; 

b) identifying all possible child nodes of the current node in the hierarchical data structure; 

c) calculating, for each of the Identified child nodes, a respective estimated probability value for matching the 
40 each component of the query word with the component associated with a respective one of the branch nodes 

in a path taken in the hierarchical data structure from the root node to the current node; 

d) adding the identified child nodes to a list of candidate nodes; 

45 e) selecting, from the list of candidate nodes, one node having the respective estimated probability value which 

is greater than any other probability value as the cun^nt node; 

f) detemilning if the cun-ent node is a leaf node and, if so, then detenmining whether to store tiie word repre- 
senting the leaf node Into a list of best matches; and 

50 

g) repeating steps (b) through (g) until alt letters of the query word have been matched. 



9. A computer readable medium according to claim 8 wherein the computer program furtiier causes the computer to 
perform the following steps: 

55 

establishing a data structure which includes a plurality of entries, wherein each entry consists of a plurality of 
elements for each of the nodes in tiie hierarchical data structure, each entry containing eienr^ents identifying: 
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(1 ) a respective type of error condition that possibly results in a query sequence occurring; and 

(2) infonnation identifying an estimated probability that the query sequence matches any of the plurality of 
words represented by the respective paths in the hierarchical data structure that pass through the child 

5 node associated with the entry, if the identified type of error condition is present, 

wherein, step (b) includes selecting the plurality of elements from among the elements with which the entries in the 
data structure are associated. 

10 10. A computer readable medium according to claim 8 wherein, the hierarchical data structure is a trie data structure 
representing a plurality of words wherein, the trie data structure has a plurality of branch nodes, wherein, each 
branch node Includes at least one child node, wherein the computer program, at step (c) causes the computer to 
perform the step of determining a probability that a next portion of the query word niatches each child node of the 
branch node. 

75 

11. A computer readable medium according to claim 10 wherein, the trie data structure has N levels, ordinally num- 
bered 0 through N-1 and step (d) includes for each candidate node, further causing the computer to perform the 
step of: 

20 calculating the estimated probability value associated with the selected element, according to the equation: 

f(n)=g*(n) X h(n) 

wherein f (n) is an estimate of the overall probability that the components of the query word match the components 
25 associated with respective ones of the branch nodes in an optimal path from a root node constelned to go through 
node n, g*(n) is the probability that respective components of the query word match the components associated 
with respective ones of the branch nodes In the path from the root node to the node n and h(n} is the overall prob- 
ability that respecth^e components of the query word match the components associated with respective ones of the 
branch nodes In the path from the node n to the leaf node of the word with a probability that is greater than tfie prob- 
30 ability of any other words assodated with the leaf nodes rooted at the root node. 



35 
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